Diabetes Analysis

Data Reference: https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators

import pandas as pd
import numpy as np
import altair as alt

Summary

This project attempts to predict diabetes status using the Logistic Regression and LinearSVC models, against a baseline DummyClassifier on an imbalanced dataset. All models achieved similar accuracy on the test set (approximately 0.86), which highlights a key issue: accuracy alone is not a reliable performance metric.

These findings motivate deeper exploratory data analysis, evaluation with additional metrics (precision, recall, F1), and exploration of alternative models and threshold tuning to get a more robust assessment of the model’s predictability.

Introduction

Diabetes is a chronic disease that prevents the body from properly controlling blood sugar levels, which can lead to serious health problems including heart disease, vision loss, kidney disease, and limb amputation (Teboul, 2020). Given the severity of the disease, early detection can allow people to make lifestyle changes and receive treatment that can slow disease progression. We believe that machine learning models using survey data can offer a promising way to create accessible, cost-effective screening tools to identify high-risk individuals and support public health efforts.

Research Question

Can we use health indicators and lifestyle factors from the CDC’s Behavioral Risk Factor Surveillance System (BRFSS) survey to accurately predict whether an individual has diabetes?

We are looking to : 1. Build and evaluate classification models that predict diabetes status based on 21 health and lifestyle features 2. Compare the performance and efficiency of logistic regression and support vector machine (SVM) classifiers 3. Assess whether survey-based features can provide sufficiently accurate predictions for practical screening applications

Methods & Results

This analysis uses the diabetes_binary_health_indicators_BRFSS2015.csv dataset, a cleaned and preprocessed version of the CDC’s 2015 Behavioral Risk Factor Surveillance System (BRFSS) survey data, made available by Alex Teboul on Kaggle (Teboul, 2020).

For this analysis, we split the dataset into training (80%) and testing (20%) sets using a fixed random state (522) to ensure reproducibility. We implemented two classification algorithms:

  1. Logistic Regression: A linear model appropriate for binary classification that estimates the probability of diabetes based on a linear combination of features.
  2. Linear Support Vector Classifier (SVC): A classifier that finds an optimal hyperplane to separate diabetic from non-diabetic individuals.

Both models were implemented using scikit-learn pipelines that include feature standardization (StandardScaler) to normalize the numeric features to comparable scales. Binary categorical features were already processed in the dataset and were set to pass through the column transformer. We evaluated model performance using cross-validation on the training set and final accuracy assessment on the held-out test set.

Our results show that both models achieve approximately 86% accuracy, with logistic regression demonstrating slightly faster training time.

Discussion

The baseline DummyClassifier achieves an accuracy score of about 0.86, derived from assigning the most frequent class (non-diabetic) to all patients. This highlights how approximatey 86% of the dataset is non-diabetic. Both Logistic Regression and LinearSVC achieve similar accuracy (approximately 0.86) with little to no improvement.

The EDA showed that there is class imbalance (more non-diabetic than diabetic patients) and this may affect the models’ reliability. Therefore, more analysis is needed to explore additional models, check class balance with metrics such as precision and recall, examine confusion matrices, and test different data splits or tune hyperparameters to determine if performance is stable across scenarios before drawing strong conclusions.

The similarity in test scores is an unexpected finding. With a clean dataset containing informative and diverse features, we would expect the classification models to perform at least better than the dummy classifier. Additionally, initial hyperparameter tuning for logistic regression did not affect accuracy (data not shown). This finding highlights the importance of understanding the data through EDA to interpret where accuracy scores come from.

This suggests the next step for deeper EDA, including distributions, to see whether features overlap and whether the model can separate them effectively. Other future questions would be determining which features are most important for classifying an individual as diabetic or not, evaluating the probability estimates, and assessing whether all features are truly helpful for drawing conclusions.

Analysis

Read in Data

from ucimlrepo import fetch_ucirepo 
   
cdc_diabetes_health_indicators = fetch_ucirepo(id=891) 
  
dat = cdc_diabetes_health_indicators.data.original

Train Test Split

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

train_df, test_df = train_test_split(dat, test_size=0.2, random_state=522)

X_train, y_train = (
    train_df.drop(columns=["Diabetes_binary"]),
    train_df["Diabetes_binary"],
)
X_test, y_test = (
    test_df.drop(columns=["Diabetes_binary"]),
    test_df["Diabetes_binary"],
)
train_df.head()
ID Diabetes_binary HighBP HighChol CholCheck BMI Smoker Stroke HeartDiseaseorAttack PhysActivity ... AnyHealthcare NoDocbcCost GenHlth MentHlth PhysHlth DiffWalk Sex Age Education Income
180125 180125 0 0 0 1 29 1 0 0 1 ... 1 0 3 0 0 0 1 9 6 7
49393 49393 0 1 1 1 26 1 0 0 1 ... 1 0 3 0 0 0 0 9 6 8
86115 86115 0 1 1 1 27 1 0 0 1 ... 1 0 2 0 0 0 1 9 4 5
249968 249968 0 0 0 1 27 0 0 0 1 ... 1 0 3 0 0 1 0 10 6 5
196362 196362 0 1 0 1 28 0 0 0 1 ... 1 0 2 0 0 0 1 8 6 8

5 rows × 23 columns

train_df.tail()
ID Diabetes_binary HighBP HighChol CholCheck BMI Smoker Stroke HeartDiseaseorAttack PhysActivity ... AnyHealthcare NoDocbcCost GenHlth MentHlth PhysHlth DiffWalk Sex Age Education Income
135498 135498 0 0 0 1 23 0 0 0 1 ... 1 0 1 2 0 0 1 6 6 8
143767 143767 0 1 1 1 28 1 0 0 1 ... 1 0 2 0 0 0 1 10 4 6
68896 68896 0 0 0 1 28 0 0 0 1 ... 1 0 3 0 0 0 1 7 4 5
247659 247659 0 0 0 0 31 0 0 0 1 ... 1 0 1 0 0 0 1 5 4 8
61332 61332 0 1 0 1 30 1 0 0 1 ... 1 0 3 0 0 0 1 4 6 8

5 rows × 23 columns

train_df.shape
(202944, 23)
train_df.columns
Index(['ID', 'Diabetes_binary', 'HighBP', 'HighChol', 'CholCheck', 'BMI',
       'Smoker', 'Stroke', 'HeartDiseaseorAttack', 'PhysActivity', 'Fruits',
       'Veggies', 'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost',
       'GenHlth', 'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age',
       'Education', 'Income'],
      dtype='object')

Data Validation

import pointblank as pb
########################## Data Validation: Correct file format

## Checks that the training data has exactly the same number of columns as the 
## DataFrame itself (validates the column count)
validation_1_1 = (
    pb.Validate(data=train_df)
    .col_count_match(len(train_df.columns))
    .interrogate()
)

## Checks that the training data has correct number of observations/rows
## 80% split for training data from the total of original data instances.
rows, cols = dat.shape
train_target = int(rows * 0.8)

validation_1_2 = (
    pb.Validate(data=train_df)
    .row_count_match(train_target)
    .interrogate()
)

validation_1_1
validation_1_2
Pointblank Validation
2025-11-29|01:12:15
Pandas
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
row_count_match
row_count_match()
202944 1 1
1.00
0
0.00
2025-11-29 01:12:15 UTC< 1 s2025-11-29 01:12:15 UTC
########################## Data Validation: Correct column names
### Check that data contains all required column names and matches the expected schema.
expected_columns = ['ID', 'Diabetes_binary', 'HighBP', 'HighChol', 'CholCheck', 'BMI',
                   'Smoker', 'Stroke', 'HeartDiseaseorAttack', 'PhysActivity', 'Fruits',
                   'Veggies', 'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost',
                   'GenHlth', 'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age',
                   'Education', 'Income']
validation_2 = (
    pb.Validate(data = train_df)
    .col_exists(columns = expected_columns)
    .interrogate()
)
validation_2
Pointblank Validation
2025-11-29|01:12:15
Pandas
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_exists
col_exists()
ID 1 1
1.00
0
0.00
#4CA64C 2
col_exists
col_exists()
Diabetes_binary 1 1
1.00
0
0.00
#4CA64C 3
col_exists
col_exists()
HighBP 1 1
1.00
0
0.00
#4CA64C 4
col_exists
col_exists()
HighChol 1 1
1.00
0
0.00
#4CA64C 5
col_exists
col_exists()
CholCheck 1 1
1.00
0
0.00
#4CA64C 6
col_exists
col_exists()
BMI 1 1
1.00
0
0.00
#4CA64C 7
col_exists
col_exists()
Smoker 1 1
1.00
0
0.00
#4CA64C 8
col_exists
col_exists()
Stroke 1 1
1.00
0
0.00
#4CA64C 9
col_exists
col_exists()
HeartDiseaseorAttack 1 1
1.00
0
0.00
#4CA64C 10
col_exists
col_exists()
PhysActivity 1 1
1.00
0
0.00
#4CA64C 11
col_exists
col_exists()
Fruits 1 1
1.00
0
0.00
#4CA64C 12
col_exists
col_exists()
Veggies 1 1
1.00
0
0.00
#4CA64C 13
col_exists
col_exists()
HvyAlcoholConsump 1 1
1.00
0
0.00
#4CA64C 14
col_exists
col_exists()
AnyHealthcare 1 1
1.00
0
0.00
#4CA64C 15
col_exists
col_exists()
NoDocbcCost 1 1
1.00
0
0.00
#4CA64C 16
col_exists
col_exists()
GenHlth 1 1
1.00
0
0.00
#4CA64C 17
col_exists
col_exists()
MentHlth 1 1
1.00
0
0.00
#4CA64C 18
col_exists
col_exists()
PhysHlth 1 1
1.00
0
0.00
#4CA64C 19
col_exists
col_exists()
DiffWalk 1 1
1.00
0
0.00
#4CA64C 20
col_exists
col_exists()
Sex 1 1
1.00
0
0.00
#4CA64C 21
col_exists
col_exists()
Age 1 1
1.00
0
0.00
#4CA64C 22
col_exists
col_exists()
Education 1 1
1.00
0
0.00
#4CA64C 23
col_exists
col_exists()
Income 1 1
1.00
0
0.00
2025-11-29 01:12:15 UTC< 1 s2025-11-29 01:12:15 UTC
########################## Data Validation: No empty observations
## Checks that all rows are complete and contain no missing values.
validation_3 = (
    pb.Validate(data = train_df)
    .rows_complete() 
    .interrogate()
)
validation_3
Pointblank Validation
2025-11-29|01:12:16
Pandas
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
rows_complete
rows_complete()
ALL COLUMNS 203K 203K
1.00
0
0.00
2025-11-29 01:12:16 UTC< 1 s2025-11-29 01:12:16 UTC
########################## Data Validation: No empty observations
## Checks that each column has 100 % non-missing values. There are no missing values in dataset. 
threshold = 1  # There are no missing values.

validator = pb.Validate(data=train_df)

for col in train_df.columns:
    validator = validator.col_vals_not_null(columns=str(col), thresholds=threshold)

validation_4 = validator.interrogate()
validation_4
Pointblank Validation
2025-11-29|01:12:16
Pandas
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_not_null
col_vals_not_null()
ID 203K 203K
1.00
0
0.00
#4CA64C 2
col_vals_not_null
col_vals_not_null()
Diabetes_binary 203K 203K
1.00
0
0.00
#4CA64C 3
col_vals_not_null
col_vals_not_null()
HighBP 203K 203K
1.00
0
0.00
#4CA64C 4
col_vals_not_null
col_vals_not_null()
HighChol 203K 203K
1.00
0
0.00
#4CA64C 5
col_vals_not_null
col_vals_not_null()
CholCheck 203K 203K
1.00
0
0.00
#4CA64C 6
col_vals_not_null
col_vals_not_null()
BMI 203K 203K
1.00
0
0.00
#4CA64C 7
col_vals_not_null
col_vals_not_null()
Smoker 203K 203K
1.00
0
0.00
#4CA64C 8
col_vals_not_null
col_vals_not_null()
Stroke 203K 203K
1.00
0
0.00
#4CA64C 9
col_vals_not_null
col_vals_not_null()
HeartDiseaseorAttack 203K 203K
1.00
0
0.00
#4CA64C 10
col_vals_not_null
col_vals_not_null()
PhysActivity 203K 203K
1.00
0
0.00
#4CA64C 11
col_vals_not_null
col_vals_not_null()
Fruits 203K 203K
1.00
0
0.00
#4CA64C 12
col_vals_not_null
col_vals_not_null()
Veggies 203K 203K
1.00
0
0.00
#4CA64C 13
col_vals_not_null
col_vals_not_null()
HvyAlcoholConsump 203K 203K
1.00
0
0.00
#4CA64C 14
col_vals_not_null
col_vals_not_null()
AnyHealthcare 203K 203K
1.00
0
0.00
#4CA64C 15
col_vals_not_null
col_vals_not_null()
NoDocbcCost 203K 203K
1.00
0
0.00
#4CA64C 16
col_vals_not_null
col_vals_not_null()
GenHlth 203K 203K
1.00
0
0.00
#4CA64C 17
col_vals_not_null
col_vals_not_null()
MentHlth 203K 203K
1.00
0
0.00
#4CA64C 18
col_vals_not_null
col_vals_not_null()
PhysHlth 203K 203K
1.00
0
0.00
#4CA64C 19
col_vals_not_null
col_vals_not_null()
DiffWalk 203K 203K
1.00
0
0.00
#4CA64C 20
col_vals_not_null
col_vals_not_null()
Sex 203K 203K
1.00
0
0.00
#4CA64C 21
col_vals_not_null
col_vals_not_null()
Age 203K 203K
1.00
0
0.00
#4CA64C 22
col_vals_not_null
col_vals_not_null()
Education 203K 203K
1.00
0
0.00
#4CA64C 23
col_vals_not_null
col_vals_not_null()
Income 203K 203K
1.00
0
0.00
2025-11-29 01:12:16 UTC< 1 s2025-11-29 01:12:17 UTC

Notes

Step 1 (local_thresholds) Step-specific thresholds set with W:1.

Step 2 (local_thresholds) Step-specific thresholds set with W:1.

Step 3 (local_thresholds) Step-specific thresholds set with W:1.

Step 4 (local_thresholds) Step-specific thresholds set with W:1.

Step 5 (local_thresholds) Step-specific thresholds set with W:1.

Step 6 (local_thresholds) Step-specific thresholds set with W:1.

Step 7 (local_thresholds) Step-specific thresholds set with W:1.

Step 8 (local_thresholds) Step-specific thresholds set with W:1.

Step 9 (local_thresholds) Step-specific thresholds set with W:1.

Step 10 (local_thresholds) Step-specific thresholds set with W:1.

Step 11 (local_thresholds) Step-specific thresholds set with W:1.

Step 12 (local_thresholds) Step-specific thresholds set with W:1.

Step 13 (local_thresholds) Step-specific thresholds set with W:1.

Step 14 (local_thresholds) Step-specific thresholds set with W:1.

Step 15 (local_thresholds) Step-specific thresholds set with W:1.

Step 16 (local_thresholds) Step-specific thresholds set with W:1.

Step 17 (local_thresholds) Step-specific thresholds set with W:1.

Step 18 (local_thresholds) Step-specific thresholds set with W:1.

Step 19 (local_thresholds) Step-specific thresholds set with W:1.

Step 20 (local_thresholds) Step-specific thresholds set with W:1.

Step 21 (local_thresholds) Step-specific thresholds set with W:1.

Step 22 (local_thresholds) Step-specific thresholds set with W:1.

Step 23 (local_thresholds) Step-specific thresholds set with W:1.

numeric_features = ["BMI"]
binary_features = ["HighBP", "HighChol", "CholCheck", "Smoker", "Stroke", 
                   "HeartDiseaseorAttack", "PhysActivity", "Fruits", "Veggies", "HvyAlcoholConsump", 
                   "AnyHealthcare", "NoDocbcCost", "DiffWalk", "Sex"]
ordinal_features = ["GenHlth", "MentHlth", "PhysHlth", "Age", "Education", "Income"]
import pointblank as pb

########################## Data Validation: Correct data types in each column
################ If fails: Critical checks (schema) -> Let it fail naturally and stop the pipeline
schema_columns = [(col, "int64") for col in train_df.columns]
schema = pb.Schema(columns=schema_columns)
(
    pb.Validate(data=train_df)
    .col_schema_match(schema=schema)
    .interrogate()
)
Pointblank Validation
2025-11-29|01:12:18
Pandas
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_schema_match
col_schema_match()
SCHEMA 1 1
1.00
0
0.00
2025-11-29 01:12:18 UTC< 1 s2025-11-29 01:12:18 UTC
########################## Data Validation: No duplicate observations
################ If fails: Non-Critical -> raise warnings and continue
unique_key_cols = ["ID"]  # use only the primary key column "ID" 
try: 
    (
        pb.Validate(data=train_df)
        .rows_distinct(columns_subset=unique_key_cols)
        .interrogate()
    )
except: 
    print("Data Validation failed: Duplicate Observation detected")
########################## Data Validation: No outlier or anomalous values for NUMERIC Features
###### Through define acceptable numeric ranges 
## (based on the data collection method and domain knowledge)
################ If fails: Non-Critical -> raise warnings and continue 
try: 
    (
        pb.Validate(data=train_df)
        .col_vals_between(columns="BMI", left=10, right=100) # BMI is unlikely to go under 10 or exceed 100
        .interrogate()
    )
except: 
    print("Data Validation failed: Outlier or anomalous values detected")
################################## checking the value ranges for ordinal features
for f in ordinal_features: 
    temp_col = train_df[f]
    print(f"========================================== {f}")
    print(f"datatype: {temp_col.dtype}")
    print(temp_col.sort_values().value_counts().index)
========================================== GenHlth
datatype: int64
Index([2, 3, 1, 4, 5], dtype='int64', name='GenHlth')
========================================== MentHlth
datatype: int64
Index([ 0,  2, 30,  5,  1,  3, 10, 15,  4, 20,  7, 25, 14,  6,  8, 12, 28, 21,
       29, 16,  9, 18, 27, 22, 17, 26, 11, 23, 13, 24, 19],
      dtype='int64', name='MentHlth')
========================================== PhysHlth
datatype: int64
Index([ 0, 30,  2,  1,  3,  5, 10, 15,  7,  4, 20, 14, 25,  6,  8, 21, 12, 28,
       29,  9, 18, 16, 17, 27, 24, 13, 11, 22, 26, 23, 19],
      dtype='int64', name='PhysHlth')
========================================== Age
datatype: int64
Index([9, 10, 8, 7, 11, 6, 13, 5, 12, 4, 3, 2, 1], dtype='int64', name='Age')
========================================== Education
datatype: int64
Index([6, 5, 4, 3, 2, 1], dtype='int64', name='Education')
========================================== Income
datatype: int64
Index([8, 7, 6, 5, 4, 3, 2, 1], dtype='int64', name='Income')
########################## Data Validation: Correct category levels for Category/Ordinal Features
###### Through define acceptable value set or range
## (based on the data collection method and domain knowledge)
################ If fails: Non-Critical -> raise warnings and continue 
try: 
    (
        pb.Validate(data=train_df)
        .col_vals_in_set(columns=binary_features, set=[0,1]) # binary features: 0/1
        .col_vals_in_set(columns="GenHlth", set=list(range(1,6))) # scale of 1-5
        .col_vals_between(columns=["MentHlth", "PhysHlth"], left=0, right=30) # number of days out of 30 days
        .col_vals_in_set(columns="Age", set=list(range(1,14))) # scale of 1-13
        .col_vals_in_set(columns="Education", set=list(range(1,7))) # scale of 1-6
        .col_vals_in_set(columns="Income", set=list(range(1,9))) # scale of 1-8
        .interrogate()
    )
except: 
    print("Data Validation failed: Incorrect category levels detected")
train_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 202944 entries, 180125 to 61332
Data columns (total 23 columns):
 #   Column                Non-Null Count   Dtype
---  ------                --------------   -----
 0   ID                    202944 non-null  int64
 1   Diabetes_binary       202944 non-null  int64
 2   HighBP                202944 non-null  int64
 3   HighChol              202944 non-null  int64
 4   CholCheck             202944 non-null  int64
 5   BMI                   202944 non-null  int64
 6   Smoker                202944 non-null  int64
 7   Stroke                202944 non-null  int64
 8   HeartDiseaseorAttack  202944 non-null  int64
 9   PhysActivity          202944 non-null  int64
 10  Fruits                202944 non-null  int64
 11  Veggies               202944 non-null  int64
 12  HvyAlcoholConsump     202944 non-null  int64
 13  AnyHealthcare         202944 non-null  int64
 14  NoDocbcCost           202944 non-null  int64
 15  GenHlth               202944 non-null  int64
 16  MentHlth              202944 non-null  int64
 17  PhysHlth              202944 non-null  int64
 18  DiffWalk              202944 non-null  int64
 19  Sex                   202944 non-null  int64
 20  Age                   202944 non-null  int64
 21  Education             202944 non-null  int64
 22  Income                202944 non-null  int64
dtypes: int64(23)
memory usage: 37.2 MB
from deepchecks.tabular import Dataset, Suite

deep_train = Dataset(train_df.drop(columns=['ID']),
                     label="Diabetes_binary",
                     cat_features=binary_features)
/Users/iangault/miniforge3/envs/522-project/lib/python3.11/site-packages/deepchecks/core/serialization/dataframe/html.py:16: UserWarning:

pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
from deepchecks.tabular.checks import ClassImbalance, FeatureLabelCorrelation, FeatureFeatureCorrelation
import anywidget, ipywidgets

########################## Data Validation: Check for class imbalance, anomalous results between feature-feature or feature-label
### Having class imbalance for diabetes prediction is expected, isn't a warning about the dataset
### Feature-label: Chose 0.5 as a threshold, given that it is variable health and lifestyle data and that it would be unexpected to find high coorelations for any one feature
### Feature-Feature: watches for multicolinearity, set threhold higher because it's reasonable for some features to potentially be more coorelated here

### Ian Gault: I looked up example on ChatGPT5 on how to use deepchecks for class imbalance and coorelations and what modules they would be in. I found the synthax with Suite and implemented that style here.  I was also running into errors packages being synced or needed for deepchecks, so found out more information about these errors too for debugging purposes.

suite = Suite(
    "Validation",
    ClassImbalance(),
    FeatureLabelCorrelation(correlation_threshold=0.5),
    FeatureFeatureCorrelation(correlation_threshold=0.7),
)

suite_result = suite.run(deep_train)

suite_result
[Class Imbalance: {0: 0.86, 1: 0.14},
 Feature Label Correlation: {'BMI': 0.025392252368002362, 'MentHlth': 0.00187223805760016, 'PhysHlth': 0.0004214881102744822, 'Age': 2.4552993846964164e-07, 'Education': 2.423832833325499e-07, 'Income': 2.393484367427756e-07, 'CholCheck': 1.9204200058707858e-07, 'HeartDiseaseorAttack': 1.8853309243567383e-07, 'Sex': 1.8634822568287007e-07, 'Smoker': 1.8471139560866802e-07, 'AnyHealthcare': 1.8079021379344183e-07, 'NoDocbcCost': 1.8079021379344183e-07, 'Veggies': 1.78548217815205e-07, 'GenHlth': 1.7781842650920206e-07, 'PhysActivity': 1.7681596424922617e-07, 'DiffWalk': 1.7429470691623626e-07, 'HighBP': 0.0, 'HighChol': 0.0, 'Stroke': 0.0, 'Fruits': 0.0, 'HvyAlcoholConsump': 0.0},
 Feature-Feature Correlation:                            BMI   GenHlth  MentHlth  PhysHlth       Age  \
 BMI                        1.0  0.248686  0.048759  0.097948 -0.012271   
 GenHlth               0.248686       1.0  0.224643  0.459304  0.142618   
 MentHlth              0.048759  0.224643       1.0  0.295968 -0.164163   
 PhysHlth              0.097948  0.459304  0.295968       1.0  0.063486   
 Age                  -0.012271  0.142618 -0.164163  0.063486       1.0   
 Education            -0.134065 -0.281615  -0.05047 -0.142485 -0.119831   
 Income               -0.099379 -0.351473 -0.131049 -0.231459 -0.180841   
 HighBP                0.206217  0.294128   0.05726  0.166951  0.344644   
 HighChol              0.096175  0.207223  0.047914  0.120044  0.270924   
 CholCheck              0.04488  0.044226  0.005197  0.033418   0.09707   
 Smoker                 0.01004  0.149567  0.065309  0.104804   0.13725   
 Stroke                0.016781  0.168659  0.072189  0.156608  0.122517   
 HeartDiseaseorAttack  0.040529  0.253665   0.05552  0.176386  0.230413   
 PhysActivity          0.135847  0.270157  0.119464  0.226629  0.103556   
 Fruits                0.088749  0.103484  0.052649  0.032026  0.059362   
 Veggies               0.063833  0.113243  0.044578  0.052623  0.013829   
 HvyAlcoholConsump      0.04814  0.035961  0.047216  0.016919  0.050196   
 AnyHealthcare         0.033011  0.057583   0.05472  0.011582  0.128848   
 NoDocbcCost             0.0482  0.158769  0.176788  0.155456  0.121746   
 DiffWalk              0.187383   0.45961  0.228855  0.494396  0.213355   
 Sex                   0.025625  0.016219  0.069534  0.044024  0.018386   
 
                      Education    Income    HighBP  HighChol CholCheck  ...  \
 BMI                  -0.134065 -0.099379  0.206217  0.096175   0.04488  ...   
 GenHlth              -0.281615 -0.351473  0.294128  0.207223  0.044226  ...   
 MentHlth              -0.05047 -0.131049   0.05726  0.047914  0.005197  ...   
 PhysHlth             -0.142485 -0.231459  0.166951  0.120044  0.033418  ...   
 Age                  -0.119831 -0.180841  0.344644  0.270924   0.09707  ...   
 Education                  1.0  0.448571  0.141847  0.065725  0.003434  ...   
 Income                0.448571       1.0  0.178816  0.085191  0.019539  ...   
 HighBP                0.141847  0.178816       1.0  0.065386  0.013644  ...   
 HighChol              0.065725  0.085191  0.065386       1.0  0.011797  ...   
 CholCheck             0.003434  0.019539  0.013644  0.011797       1.0  ...   
 Smoker                0.154603  0.089058  0.007172  0.005763  0.000003  ...   
 Stroke                0.073483  0.135976   0.01908  0.008409  0.004378  ...   
 HeartDiseaseorAttack  0.101728  0.140748    0.0384  0.028009  0.003874  ...   
 PhysActivity          0.190489  0.194654  0.011498  0.003996   0.00019  ...   
 Fruits                0.096641  0.059901  0.000262  0.000395   0.00099  ...   
 Veggies               0.133827  0.135437  0.002986  0.000935  0.001046  ...   
 HvyAlcoholConsump      0.01526  0.050463       0.0       0.0  0.000403  ...   
 AnyHealthcare         0.130493  0.159388  0.001038  0.001493  0.028118  ...   
 NoDocbcCost            0.11176  0.219431  0.000052  0.000002  0.007061  ...   
 DiffWalk              0.198533  0.330611  0.045017  0.017457  0.004597  ...   
 Sex                   0.030295  0.135643  0.002325   0.00143  0.000165  ...   
 
                         Stroke HeartDiseaseorAttack PhysActivity    Fruits  \
 BMI                   0.016781             0.040529     0.135847  0.088749   
 GenHlth               0.168659             0.253665     0.270157  0.103484   
 MentHlth              0.072189              0.05552     0.119464  0.052649   
 PhysHlth              0.156608             0.176386     0.226629  0.032026   
 Age                   0.122517             0.230413     0.103556  0.059362   
 Education             0.073483             0.101728     0.190489  0.096641   
 Income                0.135976             0.140748     0.194654  0.059901   
 HighBP                 0.01908               0.0384     0.011498  0.000262   
 HighChol              0.008409             0.028009     0.003996  0.000395   
 CholCheck             0.004378             0.003874      0.00019   0.00099   
 Smoker                0.005266             0.014002     0.004228  0.003194   
 Stroke                     1.0             0.043585     0.005629  0.000001   
 HeartDiseaseorAttack  0.043585                  1.0     0.004961  0.000072   
 PhysActivity          0.005629             0.004961          1.0  0.016617   
 Fruits                0.000001             0.000072     0.016617       1.0   
 Veggies                 0.0015             0.001514     0.021121  0.050952   
 HvyAlcoholConsump     0.002543              0.00308     0.000151  0.000868   
 AnyHealthcare          0.00024             0.001034      0.00346  0.001229   
 NoDocbcCost           0.001417             0.000666     0.004101  0.001209   
 DiffWalk              0.035816             0.044123     0.053121  0.000742   
 Sex                   0.000002             0.007933     0.001792   0.00741   
 
                        Veggies HvyAlcoholConsump AnyHealthcare NoDocbcCost  \
 BMI                   0.063833           0.04814      0.033011      0.0482   
 GenHlth               0.113243          0.035961      0.057583    0.158769   
 MentHlth              0.044578          0.047216       0.05472    0.176788   
 PhysHlth              0.052623          0.016919      0.011582    0.155456   
 Age                   0.013829          0.050196      0.128848    0.121746   
 Education             0.133827           0.01526      0.130493     0.11176   
 Income                0.135437          0.050463      0.159388    0.219431   
 HighBP                0.002986               0.0      0.001038    0.000052   
 HighChol              0.000935               0.0      0.001493    0.000002   
 CholCheck             0.001046          0.000403      0.028118    0.007061   
 Smoker                0.000284          0.014651      0.000023    0.001257   
 Stroke                  0.0015          0.002543       0.00024    0.001417   
 HeartDiseaseorAttack  0.001514           0.00308      0.001034    0.000666   
 PhysActivity          0.021121          0.000151       0.00346    0.004101   
 Fruits                0.050952          0.000868      0.001229    0.001209   
 Veggies                    1.0          0.000193      0.000086    0.000507   
 HvyAlcoholConsump     0.000193               1.0      0.000004    0.000062   
 AnyHealthcare         0.000086          0.000004           1.0    0.073258   
 NoDocbcCost           0.000507          0.000062      0.073258         1.0   
 DiffWalk              0.004895          0.001377      0.000028    0.019495   
 Sex                   0.003473          0.000016      0.000022    0.004393   
 
                       DiffWalk       Sex  
 BMI                   0.187383  0.025625  
 GenHlth                0.45961  0.016219  
 MentHlth              0.228855  0.069534  
 PhysHlth              0.494396  0.044024  
 Age                   0.213355  0.018386  
 Education             0.198533  0.030295  
 Income                0.330611  0.135643  
 HighBP                0.045017  0.002325  
 HighChol              0.017457   0.00143  
 CholCheck             0.004597  0.000165  
 Smoker                0.011808  0.006184  
 Stroke                0.035816  0.000002  
 HeartDiseaseorAttack  0.044123  0.007933  
 PhysActivity          0.053121  0.001792  
 Fruits                0.000742   0.00741  
 Veggies               0.004895  0.003473  
 HvyAlcoholConsump     0.001377  0.000016  
 AnyHealthcare         0.000028  0.000022  
 NoDocbcCost           0.019495  0.004393  
 DiffWalk                   1.0  0.004581  
 Sex                   0.004581       1.0  
 
 [21 rows x 21 columns]]

Data Visualization

# Check the inbalance sample size of the two classes
alt.data_transformers.enable('vegafusion')

alt.Chart(train_df, title = "Number of Records of Two Classes").mark_bar().encode(
    x = "Diabetes_binary:N", 
    y = "count()"
).properties(
    width=150,
    height=250)
# Boxplot for Numeric Features
alt.Chart(train_df).mark_boxplot().encode(
    x=alt.X('Diabetes_binary:N', title='Diabetes (0/1)'),
    y=alt.Y(alt.repeat('row'), type='quantitative')
).properties(
    width=200,
    height=150
).repeat(
    row=numeric_features, 
)

# Those having diabetes (diabetes_binary = 1) have a higher BMI on average
# Bar Chart of Proportion with Diabetes for Binary Features
alt.Chart(train_df).mark_bar().transform_fold(
    binary_features,
    as_=['feature', 'value']
).encode(
    x=alt.X('value:N', title='0 or 1'),
    y=alt.Y('mean(Diabetes_binary):Q', title='Proportion with Diabetes'),
).properties(
    width=150, 
    height=150
).facet(
    facet='feature:N', 
    columns=5
)
# Bar Chart for Ordinal Features
alt.Chart(train_df).mark_bar(size=20).encode(
    x=alt.X(alt.repeat("row"),type="quantitative", sort="ascending"), 
    y="count()",
    color="Diabetes_binary:N",
    column=alt.Column("Diabetes_binary:N")
).properties(
    width=200, 
    height=150
).repeat(
    row=ordinal_features
)

Model Training

Feature Processing

dat.columns
Index(['ID', 'Diabetes_binary', 'HighBP', 'HighChol', 'CholCheck', 'BMI',
       'Smoker', 'Stroke', 'HeartDiseaseorAttack', 'PhysActivity', 'Fruits',
       'Veggies', 'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost',
       'GenHlth', 'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age',
       'Education', 'Income'],
      dtype='object')
# features
numeric_feats = ["GenHlth", "Education", "Income", "Age", "MentHlth", "PhysHlth", "BMI"]


passthrough_feats = [
    "HighBP",
    "HighChol",
    "CholCheck",
    "Smoker",
    "Stroke",
    "HeartDiseaseorAttack",
    "PhysActivity",
    "Fruits",
    "Veggies",
    "HvyAlcoholConsump",
    "AnyHealthcare",
    "NoDocbcCost",
    "DiffWalk",
    "Sex"
]
from sklearn.compose import make_column_transformer

preprocessor = make_column_transformer(
    (StandardScaler(), numeric_feats),
    ("passthrough", passthrough_feats)
)

Dummy Classifier

from sklearn.dummy import DummyClassifier

dummy_df = DummyClassifier(strategy="most_frequent", random_state=552)

scores_dummy = pd.DataFrame(cross_validate(dummy_df, X_train, y_train, return_train_score=True)).mean()
scores_dummy
fit_time       0.009452
score_time     0.001883
test_score     0.860922
train_score    0.860922
dtype: float64

Logistic Regression

lr_pipe = make_pipeline(preprocessor, LogisticRegression(max_iter=1000))

scores_logistic = cross_validate(lr_pipe, X_train, y_train, return_train_score=True)
results = pd.DataFrame(scores_logistic)
results.mean()
fit_time       0.165687
score_time     0.004974
test_score     0.863731
train_score    0.863839
dtype: float64

Linear SVC

from sklearn.svm import LinearSVC

linear_svc_pipe = make_pipeline(preprocessor, LinearSVC(max_iter=5000))

scores = cross_validate(linear_svc_pipe, X_train, y_train, return_train_score=True)
results = pd.DataFrame(scores)
results.mean()
fit_time       0.412132
score_time     0.005083
test_score     0.863539
train_score    0.863546
dtype: float64

Final Test (predict on the testset)

from sklearn.metrics import accuracy_score

lr_pipe.fit(X_train, y_train)
prediction_lr = lr_pipe.predict(X_test)
accuracy_lr = accuracy_score(y_test, prediction_lr)

linear_svc_pipe.fit(X_train, y_train)
prediction_svc = linear_svc_pipe.predict(X_test)
accuracy_svc = accuracy_score(y_test, prediction_svc)
print(f"The accuracy of the Logistic Regression model is {accuracy_lr}")
print(f"The accuracy of Linear SVC model is {accuracy_svc}")
The accuracy of the Logistic Regression model is 0.8627207505518764
The accuracy of Linear SVC model is 0.8632726269315674

Conclusion

After training, Logistic Regression and Linear SVC produced similar accuracy on X_test (about 86%), with Logistic Regression training faster. Given the small difference, either model could be chosen for further evaluation; if speed and interpretability/probability estimates are important, it would make sense to go with Logistic Regression.

A higher-priority next step is addressing class imbalance and re-evaluating both models to see if they outperform the dummy classifier. This motivates deeper EDA, examining feature distributions and predictions, reviewing confusion matrices, and conducting hyperparameter tuning to test for potential improvements. At this point, we cannot draw firm conclusions about the models’ predictive ability based on the current dataset and features.